This is my first attempt at a blogpost based on a Tidy Tuesday dataset. This week the Tidy Tuesday project is the complete set of dialogue from The Office (US), a show that my 15-year-old son and I have binged several times over. We are not quite at the point of having memorized all the dialogue, but we can see that point from here.

I really Schruted it

As I was about to post this article, I realized that the Tidy Tuesday repository contained not only a link to the schrute package, but also a link to the article in The Pudding by Caitlyn Ralph. Caitlyn used a different dataset, officequotes.net, than either the schrute package (see below for a discussion of its problems) or the file I found created by Abhinav Ralhan. I think Caitlyn’s careful analysis was made easier by locating cleaner data, and I wish I had used the same starting point.

Creed Bratton, Quality Assurance

In December, the schrute package was released by Brad Lindblad, and I happily began a holiday project of combing through the data contained within. However, I immediately found some problems with the data. For example, let’s look at one of the most (in)famous episodes of The Office, “Dinner Party” (season 4, episode 13):

library(schrute)
theoffice %>%
    filter(episode_name == "Dinner Party") %>%
    select(index, episode, character, text) %>%
    head(10)
index episode character text
16791 13 Stanley This is ridiculous.
16792 13 Michael MAN! I WOULD LOVE TO BURN YOUR CANDLES!
16793 13 Phyllis Do you have any idea what time we’ll get out of here?
16794 13 Jan YOU BURN IT. YOU BUY IT!
16795 13 Michael Nobody likes to work late, least of all me. Do you have plans tonight?
16796 13 Michael OH GOOD. I’LL BE YOUR FIRST CUSTOMER!
16797 13 Jim Nope I don’t, remember when you told us not to make plans ’cause we’re working.
16798 13 Jan AND YOU’RE HARDLY MY FIRST!
16799 13 Michael Yes I remember. Mmm, this is B.S. This is B.S. Why are we here? I am going to call corporate. Enough is enough, I’m - God, I’m so mad! This is Michael Scott, Scranton, well we don’t want to work. No we don’t! It’s not fair to these people. These people are my friends and I care about them! We’re not going to do it! Everybody I just got off the horn with corporate and basically I told them where they could stick their little overtime assignment. Go enjoy your Friday.
16800 13 Michael THAT’S WHAT SHE SAID! THAT IS A 200 DOLLAR PLASMA SCREEN TV YOU JUST KILLED! Good luck paying me back on your zero dollars a year salary plus benefits, babe!

As anyone who has seen that supremely cringey episode knows, the dialogue in all caps is a climactic confrontation between Michael and Jan. For some reason, the index of dialogue lines in the schrute dataset, which should be chronological, has interleaved the fight with dialogue earlier in the episode. This is a common but inconsistent problem with the dialogue in the file. The index variable is fatally flawed, with no evident remedy to order the lines correctly.

Thanks to Brad Lindblad for the work he put into the schrute package. However, I was not going to be able to do the analyses I wanted to do with it.

Dunder Mifflin Infinity

When I first tried out the schrute package in December, I realized it could not provide the information I was hoping for. After searching online for an alternative, I came upon this blog entry by Abhinav Ralhan. The file posted on that page (which I had to download and then import) does have lines in the correct order. It also combines the double episodes, so “Dinner Party” is now season 4, episode 9. We can also select by scene:

office_raw <- read_csv("the-office-lines - scripts.csv") %>%
      select(id:scene, deleted, speaker, line_text)
office_raw %>% filter(season == 4, episode == 9, scene == 19) %>%
      head(12)
id season episode scene deleted speaker line_text
20687 4 9 19 FALSE Jan [Michael dips his steak into his wine] Can you not do that? It’s disgusting.
20688 4 9 19 FALSE Michael You know I have soft teeth, how can you say that?
20689 4 9 19 FALSE Jan Oops.
20690 4 9 19 FALSE Michael Excuse me for a second. [gets up from the table]
20691 4 9 19 FALSE Jim [to babysitter] So… how do you guys know each other?
20692 4 9 19 FALSE Woman I was his babysitter.
20693 4 9 19 FALSE Pam And now you guys are dating?
20694 4 9 19 FALSE Dwight Purely carnal and that’s all you need to know.
20695 4 9 19 FALSE Jim Would you write down your e-mail because I have just so many questions…
20696 4 9 19 FALSE Woman E-mail?
20697 4 9 19 FALSE Jim Nevermind.
20698 4 9 19 FALSE Michael Ok… alright… here we go. [takes down huge painting behind his seat and puts up a neon beer sign] There. [plugs it in] Oooookay.

However, it has other problems, for instance, the dialogue for the later seasons has a large number of problematic characters.

office_raw %>% 
      filter(season == 9, episode == 12) %>%
      select(line_text) %>%
      head(10)
line_text
Gotta clear out these file cabinets people, a lot of these are dead accounts. ���Scranton Mimeograph Corp?��� I don���t think we���re doing business with them any time soon. That���s odd. ��A letter from Robert Dunder. ���A valuable artifact has come into my possession. I have hidden it until such time as a person of strong intellect may safely recover it. This golden chalice is of immeasurable historical and religious significance.��� The Holy Grail.
[on phone]: Did you send Dwight on a quest for the Holy Grail?
I think I���m a little too busy these days to s— [whispering] Oh ,my God. I did send Dwight on a quest for the Holy Grail.
The Dunder Code! I completely forgot about that prank. That had to be like six or seven years ago. Stayed late every night for a month. Had a lot more free time back then.
I don���t get it.
Aha! A lightbulb.
A lightbul–
A lightbulb. Okay. Okay. [holding note over lamp] Invisible ink.
Whoa.
���Higher than numbers go.��� The ceiling above accounting!

I could not find an easy way to convert the bad characters to the punctuation marks they replaced. The best I could do is change them to a benign character that is unlikely to be part of dialogue, and I chose a tilde (~). The office_raw dataframe will be put through a couple of necessary cleanup steps as part of the import process.

# problem character � replaced with tilde to be more easily repaired
bad_char <- "�"
office_raw <- read_csv("the-office-lines - scripts.csv") %>%
      select(id:scene, deleted, speaker, line_text) %>%
      mutate(line_text = str_replace_all(line_text, bad_char, "~")) %>%
      mutate(speaker = str_trim(speaker))

Like the schrute package, the new dataset has some misspelled character names.

office_raw %>%
      select(speaker) %>%
      str_extract_all("Mic[:alpha:]+") %>%
      table()
## .
##  Micael Micahel  Michae Michael  Michal Micheal  Michel 
##       1       4       2   12202       1       9       5
office_raw %>%
    select(speaker) %>%
    str_extract_all("Dar[:alpha:]+") %>%
    table()
## .
## Darrly  Darry Darryl  Daryl 
##      2      1   1258     37

Maybe a handful of misspelled character names is not a big deal to you. I am of a different temperament when it comes to data. The polite term is “tidy”.

I found two ways to clean up the character names. The first, naturally, was brute force. I simply made a summary table of all character names in the dataset, then looked at those that appeared infrequently, going on the assumption that many errors would occur only once or twice.

office_raw %>%
      count(speaker) %>%
      filter(n == 1) %>%
      head(10)
speaker n
(Pam’s mom) Heleen 1
[Clark and Pete are shown on screen]
Video Andy: Hey, I’m Pete, puberty is such a drag, man. And I’m Clark! I like to eat toilet paper. [Clark and Pete wave at camera] We fail! [Video shows memorial of Jerry 1
[repeats]
Andy: Fail 1
3rd Athlead Employee 1
4th Athlead Employee 1
abe 1
Actress 1
All but Oscar 1
All Girls 1
All the Men 1

Some are clearly errors (Heleen, Anglea, Carrol, Chares). Some are not. Several occur when a character is first introduced in the show and his/her full name is used, like Carol Stills. To see how daunting the cleanup process can be, let’s just see how many variations there are on minor character David Wallace:

office_raw %>%
      select(speaker) %>%
      str_extract_all("[:alpha:]+id W[:alpha:]+") %>%
      table()
## .
## Dacvid Walalce Dacvid Wallace  David Wallace  David Wallcve 
##              1              1            110              1

So I spent a long afternoon creating an endless series of search and replace functions.

office_tidy_chars <- office_raw %>%
      mutate(speaker = str_replace(speaker, "Mic[:alpha:]+", "Michael")) %>%
      mutate(speaker = str_replace(speaker, "[:alpha:]+hael", "Michael")) %>%
      mutate(speaker = str_replace(speaker, "Dight", "Dwight")) %>%
      mutate(speaker = str_replace(speaker, 
                                   "^Dwig[:alpha:]*[:punct:]*$", "Dwight")) %>%
      mutate(speaker = str_replace(speaker, "Meridith", "Meredith")) %>%
      mutate(speaker = str_replace(speaker, "Stanely", "Stanley")) %>%
      mutate(speaker = str_replace(speaker, "sAndy", "Andy")) %>%
      mutate(speaker = str_replace(speaker, "^Ang[:alpha:]+", "Angela")) %>%
      mutate(speaker = str_replace(speaker, "^Darr[:alpha:]*", "Darryl")) %>%
      mutate(speaker = str_replace(speaker, "Daryl", "Darryl")) %>%
      mutate(speaker = str_replace(speaker, "Phyl[:alpha:]*", "Phyllis")) %>%
      mutate(speaker = str_replace(speaker, "^abe$", "Gabe")) %>%
      mutate(speaker = str_replace(speaker, "Holy", "Holly")) %>%
      mutate(speaker = str_replace(speaker, "Chares", "Charles")) %>%
      mutate(speaker = str_replace(speaker,
                                  "[:alpha:]+id Wa[:alpha:]+", "David Wallace")) %>%
      mutate(speaker = str_replace(speaker, 
                                   "Denagelo|DeAngelo|DeAgnelo", "Deangelo")) %>%
      mutate(speaker = str_replace(speaker, "M Michael", "Michael")) %>%
      mutate(speaker = str_replace(speaker, "^D$", "Dwight")) %>%
      mutate(speaker = str_replace(speaker, "^Carro[:alpha:]+", "Carol")) %>%
      mutate(speaker = str_replace(speaker, "Heleen", "Helene")) %>%
      mutate(speaker = str_replace(speaker, "Mayers", "Meyers")) %>%
      mutate(speaker = str_replace(speaker, "Liptop", "Lipton")) %>%
# most of the changes below were for consistency, not to correct errors
      mutate(speaker = str_replace(speaker, "Andy/", "Andy and ")) %>%
      mutate(speaker = str_replace(speaker, "Pam/", "Pam and ")) %>%
      mutate(speaker = str_replace(speaker, "Andy/", "Andy and ")) %>%
      mutate(speaker = str_replace(speaker, "Michael/", "Michael and ")) %>%
      mutate(speaker = str_replace(speaker, "Deangelo/", "Deangelo and ")) %>%
      mutate(speaker = str_replace(speaker, "Angela/", "Angela and ")) %>%
      mutate(speaker = str_replace(speaker, 
                                   "Gabe/Kelly/Toby", "Gabe, Kelly, and Toby")) %>%
      mutate(speaker = str_replace(speaker, "David Wallace", "David")) %>%
      mutate(speaker = str_replace(speaker, "Todd Packer", "Todd")) %>%
      mutate(speaker = str_replace(speaker, "Packer", "Todd")) %>%
      mutate(speaker = str_replace(speaker, "Robert California", "Robert")) %>%
      mutate(speaker = str_replace(speaker, "^Bob$", "Bob Vance")) %>%
      mutate(speaker = str_replace(speaker, ", Vance Refrigeration", "")) %>%
      mutate(speaker = str_replace(speaker, "Irving", "Erving")) %>%
      mutate(speaker = str_replace(speaker, "^Julius$", "Julius Erving")) %>%
      mutate(speaker = str_replace(speaker, "MeeMaw", "Mee-Maw")) %>%
      mutate(speaker = str_replace(speaker, "&", "and")) %>%
      mutate(speaker = str_replace(speaker, "worker", "Worker")) %>%
      mutate(speaker = str_replace(speaker, "#", "")) %>%
      mutate(speaker = str_replace(speaker, "CameraMan", "Cameraman")) %>%
      mutate(speaker = str_replace(speaker, " Guy", " guy")) %>%
      mutate(speaker = str_replace(speaker, " Employ", " employ")) %>%
      mutate(speaker = str_replace(speaker, " Member", " member")) %>%
      mutate(speaker = str_replace(speaker, " Phone", " phone")) %>%
      mutate(speaker = str_replace(speaker, " Club", " club")) %>%
      mutate(speaker = str_replace(speaker, " Manager", " manager")) %>%
      mutate(speaker = str_replace(speaker, " Drive", " drive")) %>%
      mutate(speaker = str_replace(speaker, " Crew", " crew")) %>%
      mutate(speaker = str_replace(speaker, " Worker", " worker")) %>%
      mutate(speaker = str_replace(speaker, " Teacher", " teacher")) %>%
      mutate(speaker = str_replace(speaker, " Shareholder", " shareholder")) %>%
      mutate(speaker = str_replace(speaker, " Pregnant", " pregnant")) %>%
      mutate(speaker = str_replace(speaker, " Assistant", " assistant")) %>%
      mutate(speaker = str_replace(speaker, " Guest", " guest")) %>%
      mutate(speaker = str_replace(speaker, " Voice", " voice")) %>%
      mutate(speaker = str_replace(speaker, " Mom", " mom")) %>%
      mutate(speaker = str_replace(speaker, " Dad", " dad")) %>%
      mutate(speaker = str_replace(speaker, " Father", " father")) %>%
      mutate(speaker = str_replace(speaker, " Brother", " brother")) %>%
      mutate(speaker = str_replace(speaker, " Sister", " sister")) %>%
      mutate(speaker = str_replace(speaker, " Son", " son")) %>%
      mutate(speaker = str_replace(speaker, " Girl", " girl")) %>%
      mutate(speaker = str_replace(speaker, " Woman", " woman")) %>%
      mutate(speaker = str_replace(speaker, " Man", " man")) %>%
      mutate(speaker = str_replace(speaker, " Salesman", " salesman")) %>%
      mutate(speaker = str_replace(speaker, "^Everybody$", "Everyone"))
office_raw %>%
      distinct(speaker) %>%
      nrow()
## [1] 793
office_tidy_chars %>%
      distinct(speaker) %>%
      nrow()
## [1] 730

The 793 characters have been condensed to 730. I could likely do better, but it may require case-by-case inspection.

Einsteins

A few days after writing out that interminable list of commands, I asked in the R4DS Slack channel if anyone had a better way, and Scott Came told me about the fuzzyjoin package created by David Robinson. It offers the possibility to join based on near-matches, an indispensible tool to hack through an arduous task more quickly.

The first thing I did was to identify the most common characters in a dataframe called base_chars, going on the assumption that all character names appearing over 100 times contained no misspellings. Then I rejoined it to the original data using the fuzzyjoin function stringdist_left_join and filtered for matches that did not fit perfectly.

# identify the 31 characters with 100+ lines of dialogue
base_chars <- office_raw %>%
      count(speaker) %>%
      filter(n >= 100) %>%
      select(base_char = speaker)
      
# use fuzzy left_join, then filter for closest but not identical
corr_char <- office_raw %>%
      count(speaker) %>%
      stringdist_left_join(base_chars, by = c("speaker" = "base_char"), 
                           method = "cosine",
                           max_dist = 0.2, distance_col = "Distance") %>%
      arrange(Distance, desc(n)) %>%
      filter(n <= 100 & Distance <= 0.101)
corr_char %>% head(40)
speaker n base_char Distance
Micheal 9 Michael 0.0000000
Micahel 4 Michael 0.0000000
Darrly 2 Darryl 0.0000000
Stanely 2 Stanley 0.0000000
Anglea 1 Angela 0.0000000
Denagelo 1 Deangelo 0.0000000
Dacvid Walalce 1 David Wallace 0.0200421
Dacvid Wallace 1 David Wallace 0.0200421
Miichael 1 Michael 0.0438171
Phylis 2 Phyllis 0.0474207
David Wallcve 1 David Wallace 0.0488103
Daryl 37 Darryl 0.0513167
Holy 2 Holly 0.0550888
Darry 1 Darryl 0.0645857
M ichael 1 Michael 0.0645857
Michel 5 Michael 0.0741799
Michae 2 Michael 0.0741799
Chares 1 Charles 0.0741799
Dwight: 1 Dwight 0.0741799
Dwight. 1 Dwight 0.0741799
Micael 1 Michael 0.0741799
Michal 1 Michael 0.0741799
Mihael 1 Michael 0.0741799
Angel 1 Angela 0.0871291
Dight 1 Dwight 0.0871291
DeAngelo 79 Deangelo 0.1000000
Meridith 2 Meredith 0.1000000
DeAgnelo 1 Deangelo 0.1000000
sum(corr_char$n)
## [1] 163

Using this approach, I can see that the first 28 fuzzy matches (Distance <= 0.101) identify misspelled character names. Note that the distance using the cosine method returns 0 (perfect match) for transposed letters, so I had to filter by number of matches < 100.

The fuzzyjoin approach rapidly diminishes the work required; I cleaned up 163 misspelled characters. However, 1) I still needed to manually inspect the matches to determine which were errors and which were legitimate entries, and 2) I didn’t find all the misspellings.

office_tidy_chars_fj <- office_raw %>%
      left_join(select(corr_char, speaker, base_char), by = "speaker") %>%
      group_by(id) %>%
      mutate(speaker = coalesce(base_char, speaker)) %>%
      select(-base_char)
diag_lines_fj <- office_tidy_chars_fj %>%
      group_by(speaker) %>%
      summarize(n = n()) %>%
      arrange(desc(n))
diag_lines_fj %>%
      stringdist_left_join(base_chars, by = c("speaker" = "base_char"), 
                           method = "cosine",
                           max_dist = .5, distance_col = "Distance") %>%
      group_by(speaker) %>%
      filter(Distance > 0) %>%
      arrange(Distance, desc(n)) %>%
      head(10)
speaker n base_char Distance
Randy 2 Ryan 0.1055728
sAndy 1 Andy 0.1055728
Phyliss 1 Phyllis 0.1111111
Chelsea 1 Charles 0.1180829
Meredith’s Vet 3 Meredith 0.1235402
Deangelo/Michael 2 Deangelo 0.1317569
Joan 5 Jan 0.1339746
Rory 2 Roy 0.1339746
abe 1 Gabe 0.1339746
Molly 3 Holly 0.1428571

For a dataset that contains sitcom dialogue, some degree of error is tolerable. For one that has critically important information, for example for a scientific publication, all of the errors must be tracked down. This dataset illustrates just how difficult that task can be.

In the end, although the fuzzyjoin approach was a time-saver, I returned to the brute force approach described earlier, using over 70 str_replace steps. That created the office_tidy_chars dataframe. I needed to run more checks to look for anomalous entries, e.g. dialogue misclassified as character names.

office_tidy_chars %>%
      group_by(speaker) %>%
      summarize(n = n()) %>%
      arrange(desc(n)) %>%
      filter(n == 1 & str_length(speaker) > 30)
speaker n
[Clark and Pete are shown on screen]
Video Andy: Hey, I’m Pete, puberty is such a drag, man. And I’m Clark! I like to eat toilet paper. [Clark and Pete wave at camera] We fail! [Video shows memorial of Jerry 1
Andy, Creed, Kevin, Kelly, Darryl 1
Female church member [to Michael] 1
Group: Dunder Mifflin!
Andy: Andy Bernard presents: Summer Softball Epic Fails! [Kevin swings bat on screen, fart noise follows] Fail. [repeats] Fail 1
Meredith, Creed, Oscar and Matt 1
Oscar’s voice from the computer 1
Phyllis, Meredith, Michael, Kevin 1

For the three problematic cases, I found it easiest to change them manually.

office_tidy_chars[53564, 5] <- office_tidy_chars[53564, 6]
office_tidy_chars[53565, 5] <- office_tidy_chars[53565, 6]
office_tidy_chars[53573, 5] <- office_tidy_chars[53573, 6]
office_tidy_chars[53564, 6] <- "Andy"
office_tidy_chars[53565, 6] <- "Andy"
office_tidy_chars[53573, 6] <- "Andy"

The use of the tilde to replace the bad characters in the data allowed me to repair some, though not all, of the altered words. The mutate steps reduced the number of bad lines from 2559 in office_tidy_chars to 763 in the new dataframe, office_tidy_dial.

office_tidy_dial <- office_tidy_chars %>%
      mutate(line_text = str_replace_all(line_text, "~~~s", "'s")) %>%
      mutate(line_text = str_replace_all(line_text, "~~~d", "'d")) %>%
      mutate(line_text = str_replace_all(line_text, "~~~t", "'t")) %>%
      mutate(line_text = str_replace_all(line_text, "~~~m", "'m")) %>%
      mutate(line_text = str_replace_all(line_text, "~~~ll", "'ll")) %>%
      mutate(line_text = str_replace_all(line_text, "~~~re", "'re")) %>%
      mutate(line_text = str_replace_all(line_text, "~~~ve", "'ve")) %>%
      mutate(line_text = str_replace_all(line_text, "~~~em", "'em")) %>%
      mutate(line_text = str_replace_all(line_text, "~~~mon", "'mon"))

One final, critical practice to make my life easier: once I have the characters and dialogue reasonably cleaned up, save it as a file so I don’t have to recreate it!

write_csv(office_tidy_dial, "office_tidy_dialogue.csv")

Take the rest of the day off

Now we can finally have some fun with the data. How many times does Pam Beesly answer the phone by saying “Dunder Mifflin, this is Pam”?

office_tidy_dial %>%
      filter(str_detect(line_text, "his is Pam") & str_detect(line_text, "Dunder"))
id season episode scene deleted speaker line_text
84 1 1 18 FALSE Pam Dunder Mifflin. This is Pam.
623 1 3 13 FALSE Pam Dunder Mifflin, this is Pam.
1144 1 4 50 TRUE Pam [telephone rings] Dunder Mifflin, this is Pam. Hold please.
1145 1 4 50 TRUE Michael All righty then, well I see you’re going for the whole bored supermodel thing. ‘Dunder Mifflin, this is Pam. May I help you?’ [takes a drag from an imaginary cigarette] Smoke, smoke, smoke, smoke.
1170 1 4 50 TRUE Pam [smiling] Dunder Mifflin, this is Pam. One moment I’ll transfer you.
3122 2 4 1 FALSE Pam Dunder Mifflin, this is Pam. Sure, can I ask who’s calling? Just a second.
3635 2 5 17 FALSE Pam [on phone] Dunder-Mifflin. This is Pam. [listens] Uh, yeah. [snaps her fingers in the air, getting Jim’s attention] Just one second. I will, uh, transfer you to our manager, Michael Scott.
4290 2 7 15 FALSE Pam Dunder-Mifflin, this is Pam.
4429 2 7 41 FALSE Pam Dunder-Mifflin, this is Pam.
5993 2 12 2 FALSE Pam Dunder Mifflin, this is Pam.
6332 2 12 35 FALSE Pam Dunder Mifflin, this is Pam.
6969 2 14 53 FALSE Pam [voicemail message for Jim] I’ll transfer you. Dunder Mifflin, this is Pam. Hold, please. Dunder Mifflin, this is … okay, sorry. Michael was standing at my desk, and I needed to be busy or who knows what would’ve happened, so thank you.
7333 2 15 52 FALSE Pam Dunder Mifflin. This is Pam. Uh… hold, please.
8752 2 20 56 TRUE Pam [telephone ringing] Dunder Mifflin. This is Pam. Um, hold, please. [to Jim] There’s a Brenda on the phone for you. [to Brenda] Just one second, I’ll transfer.
9037 2 21 60 TRUE Pam [telephone ringing] Dunder Mifflin. This is Pam. Hold, please. Dwight, it’s the Sheriff. He said that it’s really important. It’s regarding your desk. I’ll transfer.
9824 3 2 25 FALSE Pam [answering phone] Dunder-Mifflin, this is Pam. He’s not in the office. Can I take a message? I will. You too. [hangs up] Sorry. What’s up?
10275 3 3 62 FALSE Pam Dunder Mifflin, this is Pam. … Uh, sure, I’ll get him for you. [to Michael] It’s Jan for you.
10773 3 5 39 FALSE Pam Dunder Mifflin, this is Pam. Oh, hi Jan. He’s, uh, on a sales call. No message? Bye, Jan.
11260 3 7 18 FALSE Pam It’s a blessing in disguise. Actually, not even in disguise. Sometimes at home, I answer the phone, ‘Dunder-Mifflin, this is Pam.’ So, maybe that’ll stop now.
12915 3 11 21 FALSE Pam [on phone] Dunder Mifflin, this is Pam. Just a second. Michael, it’s Jan on the phone for you.
13219 3 12 23 TRUE Pam Dunder Mifflin, this is Pam. [excited] This is Pam. I did?
16925 3 23 77 FALSE Pam [phone rings, Pam answers] Dunder Mifflin, this is Pam. Just one moment, I’ll transfer you.
17218 4 1 35 FALSE Pam Michael Scott’s Dunder-Mifflin, Scranton, Meredith Palmer memorial, celebrity rabies awareness, fun run race for the cure, this is Pam.
18270 4 3 22 FALSE Pam Dunder Mifflin. This is Pam.
19471 4 5 30 FALSE Pam Dunder Mifflin, this is Pam.
21591 4 12 15 FALSE Pam No. [Kevin leaves; Pam takes off her glasses; phone rings] Dunder Mifflin, this is Pam. Okay, go ahead. [puts a notepad close to her face and writes message]
26108 5 11 1 FALSE Pam [answering the phone] Dunder Mifflin, this is Pam. I’m sorry, he’s not in yet. Would you like his voicemail?
26959 5 13 37 FALSE Pam Dunder Mifflin this is Pam. Oh, hey Mom. No, what did Dad say?
27038 5 13 58 FALSE Pam Dunder Mifflin this is Pam. Uh, I’m sorry, Michael’s not here right now can I take a message? Great. I will. Thanks.
27978 5 17 12 FALSE Pam Dunder Mifflin, this is Pam. Oh hi ,David. [Michael shakes his head to Pam] No, I’m sorry he’s not back from the Civil Rights rally. I’ll have him call you the minute he gets back from the Lincoln Memorial.
27998 5 17 14 FALSE Pam [on phone] Dunder Mifflin, this is Pam. Oh hi, David. He’s having a colonoscopy. Alright, I’ll find out if he’s out yet.
29276 5 21 38 FALSE Pam Dunder Miff…Michael Scott Paper Company, this is Pam. Oh, hi Russell from the pancake luncheon, how are you? Well we’d like to do business with you too! How can we make that happen?
42681 7 14 13 FALSE Pam [on phone] Dunder Mifflin, this is Pam.
54989 9 8 30 FALSE Pam [into phone] Hello, this is Pam Halpert. I’m calling from Dunder-Mifflin. Yes, your paper provider. And I just called to say… your mama is so fat, when she wears red, people yell, ~Hey, kool-aid.~ Yeah, your mama’s fat. This is Pam Halpert.
59858 9 23 94 FALSE Pam [answering the phone] Dunder Mifflin, This is Pam. Oh, I’m sorry. Jim Halpert doesn’t work here anymore.

I used the broader search string “his is Pam” to catch some slightly different phrasings as well as varying punctuation. However, that caught a few other uses of “this is Pam” that were not related to her answering the phone.

This part was long and hard

Full list of “That’s what she said” responses and who said the previous line.

shesaidvec <- str_which(office_tidy_dial$line_text, "hat she sai|HAT SHE SAI|hat She Sai")
shesaid <- office_tidy_dial[shesaidvec, ] %>%
      bind_rows(office_tidy_dial[shesaidvec - 1, ]) %>%
      arrange(id) %>%
      head(10)
# write_csv(shesaid, "thats_what_she_said.csv")

In total, there were 39 instances of “that’s what she said” (thanks to Caitlyn Ralph’s article for helping me locate a couple of variants). So how could I create a reasonably tidy dataframe that would include both the response “that’s what she said” and the line that triggered it? I had two options:

  1. Write a regular expression that could grab either the prior line or an earlier line within the same string of dialogue.
  2. Say “this is only 39 cases” and do all the cleanup in Excel.

As Sharla Gelfand wrote in one of her recent talks:

“no one is handing out medals for figuring out regular expressions”

Cleanup in Excel–nothing to see here

shesaid_full <- read_xlsx("thats_what_she_said.xlsx")
shesaid_full
id season episode scene deleted self speaker line_text next_speaker answer_text
2544 2 2 24 FALSE FALSE Jim No, thanks. I’m good. Michael That’s what she said. Pam?
2546 2 2 24 FALSE FALSE Pam Uh… my mother’s coming. Michael That’s what she sai [clears throat] Nope, but… Okay. Well, suit yourself.
2590 2 2 34 FALSE FALSE Michael And in the future, if I want to say something funny or witty or do an impression, I will no longer, ever, do any of those things. Jim Does that include ‘That’s What She Said’?
2593 2 2 34 FALSE FALSE Jim Wow! That is really hard. You really think you can go all day long? Well, you always left me satisfied and smiling, so… Michael THAT’S WHAT SHE SAID!
5324 2 10 2 FALSE FALSE Kevin [holds up the piece of tree he just cut off with a paper cutter] Well, sort of. Why did you get it so big? Michael A, that’s what she said, and B, I wanted it to be impressive. The biggest day of the year deserves the biggest tree of the year.
6321 2 12 33 FALSE FALSE Doctor Does the skin look red and swollen? Dwight That’s what she said.
6352 2 12 38 TRUE FALSE Oscar [Jim popping Michael’s bubble wrap cast] You should put butter on it. Michael Uh, that’s what she said. See, haven’t lost my sense of humor. No, no need, it was a non-stick grill.
7643 2 17 5 FALSE FALSE Dwight [eating grapes] Michael That’s what she said!
8871 2 21 22 FALSE FALSE Angela You already did me. Michael That’s what she said. [Jim mouths these words along with Michael] The thing is, Angela… you are in here an awful lot. You have complained about everybody in the office, except Dwight, which is odd because everyone else has had run ins with Dwight. Toby, by the way, what does ‘redacted’ mean? There is a file full of complaints in here marked ‘redacted’… ?
9623 3 1 48 FALSE TRUE Michael But you know what? Even if it didn’t, at least we put this matter to bed. Michael …that’s what she said. Or he said.
10903 3 5 59 FALSE FALSE Michael I mean, they’re just dough twisted up with some candy. They taste so good in my mouth. Stanley That’s what she said. [Stanley and Michael both laugh]
12593 3 10 49 FALSE FALSE Second Cindy Thanks! I, I wanna give you something. [She whispers in his ear. Michael starts to laugh] Michael Oh. That’s what she said.
13336 3 12 41 FALSE FALSE Michael I want you to think about your future in this company. I want you to think about it long and hard. Dwight That’s what she said.
14301 3 17 9 FALSE FALSE Jan Let’s just blow this party off. Michael That’s what she said.
14373 3 17 22 FALSE TRUE Jan Why is this so hard? Jan That’s what she said. Oh my God. What am I saying?
15385 3 20 11 FALSE TRUE Michael No, no. I need two men on this. Michael That’s what she said. No time! But she did. NO TIME! Guys, get on this. Dwight, I want you to be in charge of the press conference.
16405 3 22 68 FALSE FALSE Michael No mustard! No mustard! Just… eat it. Eat it, Phyllis. Dip it in the water so it will slide down your gullet more easily. Everyone That’s what she said!
17569 4 2 5 FALSE TRUE Michael Hey. Can you make that straighter? Michael That’s what she said.
18959 4 4 44 FALSE TRUE Michael And the best way to start is to hit start. And up comes the toolbar, Michael that’s what she said. What we have to do here is go to Run, and then you look up to PowerPoint. And we are in. We are going to register. You hit register— Updates are ready. I should update. Um, estimated time 12 minutes, so this should take 5 or 10 minutes.
20121 4 7 56 FALSE TRUE Michael That’s what I said. Michael That’s what she said.
20124 4 7 56 FALSE FALSE Michael I never know. I just say it. I say stuff like that, you know, to lighten the tension. When things sort of get hard. Jim That’s what she said.
20269 4 8 23 FALSE FALSE Lester And you were directly under her the entire time? Michael That’s what she said.
20271 4 8 23 FALSE FALSE Lester Excuse me? Michael That’s what she said.
20277 4 8 23 FALSE TRUE Michael Come again? Michael That’s what she said? I don’t know what you’re talking about.
20282 4 8 23 FALSE TRUE Deposition Reporter [reading off paper] Mr. Schneider: And you were directly under her the entire time? Mr. Scott: Deposition Reporter That’s what she said.
20715 4 9 19 FALSE FALSE Jan AND YOU’RE HARDLY MY FIRST! Michael THAT’S WHAT SHE SAID! [Jan gets an evil look on her face and picks up Michael’s dundie and throws it into his plasma screen tv] THAT IS A 200 DOLLAR PLASMA SCREEN TV YOU JUST KILLED! Good luck paying me back on your zero dollars a year salary plus benefits, babe! [Jan goes upstairs crying.]
21480 4 12 2 FALSE FALSE Dwight And… go. [Michael sticks his face in the cement] Force it in as deep as you can. Michael [muffled] That’s what she said.
23149 5 1 111 FALSE FALSE Jim Yeah, well, if you’re only free till three on Sunday and I can’t get there till one, then it’s gonna be pretty tight. Michael That’s what she said.
23910 5 4 30 FALSE FALSE Michael It squeaks when you bang it, Michael that’s what she said. Let’s hear it for me! Right? A bargain at any price!
24196 5 5 25 FALSE FALSE Holly Michael. Don’t. Don’t. Don’t make it harder than it has to be. Michael That’s what she said.
24754 5 6 27 FALSE FALSE Kelly Dwight, get out of my nook! Pam [in New York] That’s what she said! That’s what she said! That’s what she said!
28089 5 17 27 FALSE FALSE David Alright Dwight. This is huge. Dwight That’s what she said! [David laughs]
36370 6 18 9 FALSE FALSE Darryl You need to get back on top. Michael That’s what she said.
40586 7 8 29 FALSE FALSE Gabe Michael! You are making this harder than it has to be. Michael [grimacing] That’s what she said. [leaves]
42259 7 13 1 FALSE TRUE David No, no. No, comedy is a place where the mind goes to tickle itself. David That’s what she said. [laughs]. [hugs Michaels] Ohh.
43114 7 15 30 FALSE TRUE Holly I’m not saying it won’t be hard. But we can make it work. Holly That’s what she said.
44695 7 21 50 FALSE TRUE Michael [pulls out his mic from his shirt] This is gonna feel so good, getting this thing off my chest. [he hands them the body mic, when he speaks it is inaudible now] Michael That’s what she said! [waves goodbye and walks off to his gate, halfway there Pam comes running up to him and they hug for a while. They say their goodbyes to each other, and Michael walks off for good]
54087 9 5 29 FALSE FALSE Clark Wait! Wait. Hold on. Where’s the band? ’Cause there’s just no way you guys are making this magic with just your mouths. Creed Yeah. That’s what she said.
59750 9 23 68 FALSE FALSE Dwight [turns around] [whispering] Michael. I can’t believe you came. Michael That’s what she said.
links <- shesaid_full %>%
      count(speaker, next_speaker)
nodes <- tibble(name = unique(c(shesaid_full$speaker, shesaid_full$next_speaker)), 
                index = 0:20)
new_links <- left_join(links, nodes, by = c("speaker" = "name")) %>%
      select(speaker = index, next_speaker, n) %>%
      left_join(nodes, by = c("next_speaker" = "name")) %>%
      select(speaker, next_speaker = index, n)

Tidy Tuesday demands a graph

In trying to find a way to visualize dialogue patterns, I happened upon the networkD3 package written by Christopher Gandrud et al.. One visualization offered by the package is a Sankey diagram that shows nodes and links. I decided to give it a try.

Sankey diagram of Office characters who answered “that’s what she said”

Solid paths indicate first speaker on the left and TWSS response on the right
Dashed paths indicate the reverse
Loops indicate someone responding to themselves
sn <- sankeyNetwork(Links = new_links, Nodes = nodes, Source = "speaker",
      Target = "next_speaker", Value = "n", NodeID = "name",
      fontSize = 20, nodeWidth = 30, height = 600, width = 1000)
sn